Goto

Collaborating Authors

 multi-modal model




Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Neural Information Processing Systems

During optimization, contrastive learning keeps the different modalities separated by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness.


M4I: Multi-modalModels Membership Inference

Neural Information Processing Systems

Compared with the existing membership inference against machine learning classifiers, we focus on the problem that the input and output of the multi-modal models are in different modalities, such as image captioning.




Multi-modal Bayesian Neural Network Surrogates with Conjugate Last-Layer Estimation

arXiv.org Machine Learning

As data collection and simulation capabilities advance, multi-modal learning, the task of learning from multiple modalities and sources of data, is becoming an increasingly important area of research. Surrogate models that learn from data of multiple auxiliary modalities to support the modeling of a highly expensive quantity of interest have the potential to aid outer loop applications such as optimization, inverse problems, or sensitivity analyses when multi-modal data are available. We develop two multi-modal Bayesian neural network surrogate models and leverage conditionally conjugate distributions in the last layer to estimate model parameters using stochastic variational inference (SVI). We provide a method to perform this conjugate SVI estimation in the presence of partially missing observations. We demonstrate improved prediction accuracy and uncertainty quantification compared to uni-modal surrogate models for both scalar and time series data.



A Foundational Multi-Modal Model for Few-Shot Learning

arXiv.org Artificial Intelligence

Few-shot learning (FSL) is a machine learning paradigm that aims to generalize models from a small number of labeled examples, typically fewer than 10 per class. FSL is particularly crucial in biomedical, environmental, materials, and mechanical sciences, where samples are limited and data collection is often prohibitively costly, time-consuming, or ethically constrained. In this study, we present an innovative approach to FSL by demonstrating that a Large Multi-Modal Model (LMMM), trained on a set of independent tasks spanning diverse domains, task types, and input modalities, can substantially improve the generalization of FSL models, outperforming models based on conventional meta-learning on tasks of the same type. To support this, we first constructed a Multi-Modal Model Few-shot Dataset (M3FD, over 10K+ few-shot samples), which includes 2D RGB images, 2D/3D medical scans, tabular and time-course datasets, from which we manually curated FSL tasks such as classification. We further introduced M3F (Multi-Modal Model for Few-shot learning framework), a novel Large Multi-Modal Model framework tailored for data-constrained scientific applications. M3F supports a wide range of scientific data types through a modular pipeline. By fine-tuning the model on M3FD, M3F improves model performance, making LMMM feasible for real-world FSL deployment. The source code is located at https://github.com/ptdang1001/M3F. To democratize access to complex FSL data and promote reproducibility for public usage, M3FD is paired with a flexible and user-friendly tool that enables efficient querying, task-specific sampling, and preprocessing. Together, our dataset and framework offer a unified, scalable solution that significantly lowers the barrier to applying LMMMs in data-scarce scientific domains.


(Almost) Free Modality Stitching of Foundation Models

arXiv.org Artificial Intelligence

Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by $10\times$, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.